Skip to content

Conversation

@shaikmoeed
Copy link

What this PR does / why we need it:
Add support to list/get namespaced TrainingRuntime.

Which issue(s) this PR fixes:

Fixes #88

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shaikmoeed shaikmoeed force-pushed the fix/namespace-trainingruntime-list branch from 24f00a7 to 1740535 Compare October 29, 2025 07:36
@shaikmoeed shaikmoeed changed the title feat(backend): Support namespaced TrainingRuntime in the SDK feat(trainer): Support namespaced TrainingRuntime in the SDK Oct 29, 2025
@kramaranya
Copy link
Contributor

/ok-to-test

@shaikmoeed shaikmoeed force-pushed the fix/namespace-trainingruntime-list branch from 8f0b6d5 to de2ad1b Compare November 3, 2025 13:51
@abhijeet-dhumal
Copy link
Contributor

abhijeet-dhumal commented Nov 3, 2025

Thank you @shaikmoeed for this!
Left some nit-picks..


def get_runtime(self, name: str) -> types.Runtime:
"""Get the the Runtime object"""
"""Get the the Runtime object prefer namespaced, fall-back to cluster-scoped"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Get the the Runtime object prefer namespaced, fall-back to cluster-scoped"""
"""Get the Runtime object prefer namespaced, fall-back to cluster-scoped"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same change goes for each occurence here

)

cluster_runtime_list = models.TrainerV1alpha1ClusterTrainingRuntimeList.from_dict(
cluster_thread.get(constants.DEFAULT_TIMEOUT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cluster_thread.get(constants.DEFAULT_TIMEOUT)
cluster_thread.get(common_constants.DEFAULT_TIMEOUT)

Copy link
Author

@shaikmoeed shaikmoeed Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Reason for test case failures!

Comment on lines +503 to +533
def create_training_runtime(
name: str,
namespace: str = "default",
) -> models.TrainerV1alpha1TrainingRuntime:
"""Create a mock namespaced TrainingRuntime object (not cluster-scoped)."""
return models.TrainerV1alpha1TrainingRuntime(
apiVersion=constants.API_VERSION,
kind="TrainingRuntime",
metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
name=name,
namespace=namespace,
labels={constants.RUNTIME_FRAMEWORK_LABEL: name},
),
spec=models.TrainerV1alpha1TrainingRuntimeSpec(
mlPolicy=models.TrainerV1alpha1MLPolicy(
torch=models.TrainerV1alpha1TorchMLPolicySource(
numProcPerNode=models.IoK8sApimachineryPkgUtilIntstrIntOrString(2)
),
numNodes=2,
),
template=models.TrainerV1alpha1JobSetTemplateSpec(
metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
name=name,
namespace=namespace,
),
spec=models.JobsetV1alpha2JobSetSpec(replicatedJobs=[get_replicated_job()]),
),
),
)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to create this in kubernetes/backend_test.py?
this is not a test function and I believe it should be added to the TrainerClient and propagated to the different backends.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to create_train_job, I thought to use create_training_runtime to pass here instead of empty list. Please correct me, if this was not intended?

@shaikmoeed shaikmoeed force-pushed the fix/namespace-trainingruntime-list branch from d1ca707 to 32b18fd Compare November 19, 2025 17:56
@coveralls
Copy link

coveralls commented Nov 19, 2025

Pull Request Test Coverage Report for Build 19512318478

Details

  • 22 of 25 (88.0%) changed or added relevant lines in 3 files are covered.
  • 5 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.04%) to 66.631%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/backends/kubernetes/backend.py 15 18 83.33%
Files with Coverage Reduction New Missed Lines %
kubeflow/trainer/backends/kubernetes/backend.py 5 79.05%
Totals Coverage Status
Change from base Build 19462635145: 0.04%
Covered Lines: 2518
Relevant Lines: 3779

💛 - Coveralls

@shaikmoeed
Copy link
Author

@abhijeet-dhumal, can you review it again during your free time?

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @shaikmoeed!
I'd like all of us to spend more time on designing this properly, since there are a lot of items to consider
/assign @kubeflow/kubeflow-sdk-team

Comment on lines 101 to +112
except multiprocessing.TimeoutError as e:
raise TimeoutError(f"Timeout to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e
raise TimeoutError(
"Timeout to list "
f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s "
f"in namespace: {self.namespace}"
) from e
except Exception as e:
raise RuntimeError(f"Failed to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e
raise RuntimeError(
"Failed to list "
f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s "
f"in namespace: {self.namespace}"
) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What If only cluster runtimes exist (no TrainingRuntime CRD)? The entire list_runtimes will fail with RuntimeError, which we don't want, right?

Comment on lines 101 to +106
except multiprocessing.TimeoutError as e:
raise TimeoutError(f"Timeout to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e
raise TimeoutError(
"Timeout to list "
f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s "
f"in namespace: {self.namespace}"
) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if one type times out and the other succeeds? I think we still should return partial results

Comment on lines +103 to +105
"Timeout to list "
f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s "
f"in namespace: {self.namespace}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cluster scoped runtimes are not really in a namespace

Comment on lines +142 to +146
except Exception as e:
logger.warning(
f"Namespaced TrainingRuntime '{self.namespace}/{name}' not found "
f"({type(e).__name__}: {e}); falling back to cluster-scoped runtime."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it was for example TimeoutError? We will still silently fallback to cluster scoped runtimes, right? I would suggest only treating not found / missing CRD as fall back.


return result

def get_runtime(self, name: str) -> types.Runtime:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to list_runtimes(), either case will cause a full failure

# The Kind name for the TrainingRuntime.
TRAINING_RUNTIME_KIND = "TrainingRuntime"

# The plural for the ClusterTrainingRuntime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The plural for the ClusterTrainingRuntime.
# The plural for the TrainingRuntime.

return result

for runtime in runtime_list.items:
for runtime in namespace_runtime_list.items + cluster_runtime_list.items:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shaikmoeed
Quick question : What if runtimes with the same name exists in both cluster and namespace scoped ?
IIUC you have implemented namespace scoped priority in get_runtime() where trainingRuntimes get's first priority..

thinking should it be same case for list_runtimes method too ?
And one more thing, In case of list_runtimes it's just appending both even if duplicates comes in..
So for end user how user will be able to know the kind of runtime via list_runtimes's list items ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we introduce kind and namespace(optional) params in Runtime dataclass here :

class Runtime:

So that

for runtime in runtimes:
    print(f"{runtime.name} ({runtime.kind}, ns={runtime.namespace})")
    
# Output:
# torch-runtime (TrainingRuntime, ns=team-a)
# torch-runtime (ClusterTrainingRuntime, ns=None)
# custom-runtime (TrainingRuntime, ns=team-a)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support namespaced TrainingRuntime in the SDK

5 participants